180 research outputs found
THORN: Temporal Human-Object Relation Network for Action Recognition
Most action recognition models treat human activities as unitary events.
However, human activities often follow a certain hierarchy. In fact, many human
activities are compositional. Also, these actions are mostly human-object
interactions. In this paper we propose to recognize human action by leveraging
the set of interactions that define an action. In this work, we present an
end-to-end network: THORN, that can leverage important human-object and
object-object interactions to predict actions. This model is built on top of a
3D backbone network. The key components of our model are: 1) An object
representation filter for modeling object. 2) An object relation reasoning
module to capture object relations. 3) A classification layer to predict the
action labels. To show the robustness of THORN, we evaluate it on
EPIC-Kitchen55 and EGTEA Gaze+, two of the largest and most challenging
first-person and human-object interaction datasets. THORN achieves
state-of-the-art performance on both datasets
ICE: Inter-instance Contrastive Encoding for Unsupervised Person Re-identification
Unsupervised person re-identification (ReID) aims at learning discriminative
identity features without annotations. Recently, self-supervised contrastive
learning has gained increasing attention for its effectiveness in unsupervised
representation learning. The main idea of instance contrastive learning is to
match a same instance in different augmented views. However, the relationship
between different instances of a same identity has not been explored in
previous methods, leading to sub-optimal ReID performance. To address this
issue, we propose Inter-instance Contrastive Encoding (ICE) that leverages
inter-instance pairwise similarity scores to boost previous class-level
contrastive ReID methods. We first use pairwise similarity ranking as one-hot
hard pseudo labels for hard instance contrast, which aims at reducing
intra-class variance. Then, we use similarity scores as soft pseudo labels to
enhance the consistency between augmented and original views, which makes our
model more robust to augmentation perturbations. Experiments on several
large-scale person ReID datasets validate the effectiveness of our proposed
unsupervised method ICE, which is competitive with even supervised methods
An APRIORI-based Method for Frequent Composite Event Discovery in Videos
We propose a method for discovery of composite events in videos. The algorithm processes a set of primitive events such as simple spatial relations between objects obtained from a tracking system and outputs frequent event patterns which can be interpreted as frequent composite events. We use the APRIORI algorithm from the field of data mining for efficient detection of frequent patterns. We adapt this algorithm to handle temporal uncertainty in the data without losing its computational effectiveness. It is formulated as a generic framework in which the context knowledge is clearly separated from the method in form of a similarity measure for comparison between two video activities and a library of primitive events serving as a basis for the composite events
Partition and Reunion: A Two-Branch Neural Network for Vehicle Re-identification
International audienceThe smart city vision raises the prospect that cities will become more intelligent in various fields, such as more sustainable environment and a better quality of life for residents. As a key component of smart cities, intelligent transportation system highlights the importance of vehicle re-identification (Re-ID). However, as compared to the rapid progress on person Re-ID, vehicle Re-ID advances at a relatively slow pace. Some previous state-of-the-art approaches strongly rely on extra annotation, like attributes (e.g., vehicle color and type) and key-points (e.g., wheels and lamps). Recent work on person Re-ID shows that extracting more local features can achieve a better performance without considering extra annotation. In this paper, we propose an end-to-end trainable two-branch Partition and Reunion Network (PRN) for the challenging vehicle Re-ID task. Utilizing only identity labels, our proposed method outperforms existing state-of-the-art methods on four vehicle Re-ID benchmark datasets, including VeRi-776, Vehi-cleID, VRIC and CityFlow-ReID by a large margin
Deep-Temporal LSTM for Daily Living Action Recognition
In this paper, we propose to improve the traditional use of RNNs by employing
a many to many model for video classification. We analyze the importance of
modeling spatial layout and temporal encoding for daily living action
recognition. Many RGB methods focus only on short term temporal information
obtained from optical flow. Skeleton based methods on the other hand show that
modeling long term skeleton evolution improves action recognition accuracy. In
this work, we propose a deep-temporal LSTM architecture which extends standard
LSTM and allows better encoding of temporal information. In addition, we
propose to fuse 3D skeleton geometry with deep static appearance. We validate
our approach on public available CAD60, MSRDailyActivity3D and NTU-RGB+D,
achieving competitive performance as compared to the state-of-the art.Comment: Submitted in conferenc
Cross domain Residual Transfer Learning for Person Re-identification
International audienceThis paper presents a novel way to transfer model weights from one domain to another using residual learning framework instead of direct fine-tuning. It also argues for hybrid models that use learned (deep) features and statistical metric learning for multi-shot person re-identification when training sets are small. This is in contrast to popular end-to-end neural network based models or models that use hand-crafted features with adaptive matching models (neural nets or statistical metrics). Our experiments demonstrate that a hybrid model with residual transfer learning can yield significantly better re-identification performance than an end-to-end model when training set is small. On iLIDS-VID [42] and PRID [15] datasets, we achieve rank-1 recognition rates of 89.8% and 95%, respectively, which is a significant improvement over state-of-the-art
G3AN: Disentangling Appearance and Motion for Video Generation
Creating realistic human videos entails the challenge of being able to
simultaneously generate both appearance, as well as motion. To tackle this
challenge, we introduce GAN, a novel spatio-temporal generative model,
which seeks to capture the distribution of high dimensional video data and to
model appearance and motion in disentangled manner. The latter is achieved by
decomposing appearance and motion in a three-stream Generator, where the main
stream aims to model spatio-temporal consistency, whereas the two auxiliary
streams augment the main stream with multi-scale appearance and motion
features, respectively. An extensive quantitative and qualitative analysis
shows that our model systematically and significantly outperforms
state-of-the-art methods on the facial expression datasets MUG and UvA-NEMO, as
well as the Weizmann and UCF101 datasets on human action. Additional analysis
on the learned latent representations confirms the successful decomposition of
appearance and motion. Source code and pre-trained models are publicly
available.Comment: CVPR 2020, project link https://wyhsirius.github.io/G3AN
Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition
International audienceThe bag-of-words approach with local spatio-temporal features have become a popular video representation for action recognition in videos. Together these techniques have demonstrated high recognition results for a number of action classes. Recent approaches have typically focused on capturing global statistics of features. However, existing methods ignore relations between features and thus may not be discriminative enough. Therefore, we propose a novel feature representation which captures statistics of pairwise co-occurring local spatio-temporal features. Our representation captures not only global distribution of features but also focuses on geometric and appearance (both visual and motion) relations among the features. Calculating a set of bag-of-words representations with different geometrical arrangement among the features, we keep an important association between appearance and geometric information. Using two benchmark datasets for human action recognition, we demonstrate that our representation enhances the discriminative power of features and improves action recognition performance
Video Covariance Matrix Logarithm for Human Action Recognition in Videos
International audienceIn this paper, we propose a new local spatio-temporal descriptor for videos and we propose a new approach for action recognition in videos based on the introduced descriptor. The new descriptor is called the Video Covariance Matrix Logarithm (VCML). The VCML descriptor is based on a covariance matrix representation, and it models relationships between different low-level features, such as intensity and gradient. We apply the VCML descriptor to encode appearance information of local spatio-temporal video volumes, which are extracted by the Dense Trajectories. Then, we present an extensive evaluation of the proposed VCML descriptor with the Fisher vector encoding and the Support Vector Machines on four challenging action recognition datasets. We show that the VCML descriptor achieves better results than the state-of-the-art appearance descriptors. Moreover, we present that the VCML descriptor carries complementary information to the HOG descriptor and their fusion gives a significant improvement in action recognition accuracy. Finally, we show that the VCML descriptor improves action recognition accuracy in comparison to the state-of-the-art Dense Trajectories, and that the proposed approach achieves superior performance to the state-of-the-art methods
- …